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Using a whole-genome-sequencing approach to explore germplasm resources can serve as 
an important strategy for crop improvement, especially in investigating wild accessions 
that may contain useful genetic resources that have been lost during the domestication 
process. Here we sequence and assemble a draft genome of wild soybean and construct a 
recombinant inbred population for genotyping-by-sequencing and phenotypic analyses to 
identify multiple QTLs relevant to traits of interest in agriculture. We use a combination of 
de novo sequencing data from this work and our previous germplasm re-sequencing data to 
identify a novel ion transporter gene, GmCHXI, and relate its sequence alterations to salt 
tolerance. Rapid gain-of-function tests show the protective effects of GmCHXI towards salt 
stress. This combination of whole-genome de novo sequencing, high-density-marker QTL 
mapping by re-sequencing and functional analyses can serve as an effective strategy to unveil 
novel genomic information in wild soybean to facilitate crop improvement. 
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Advances in next-generation sequencing have driven a 
revolution in genomic analyses and their use for crop 
improvement, which is essential given the rapidly 
growing human population. There are several whole-genome- 
sequencing projects for major crops aiming to unveil novel 
insights in genomic evolution and diversity, including the 
effects of domestication and human selection 1-3 , and to apply 
whole -genome -sequencing data to actual crop improvement 
programmes. The genomic information in wild accessions 
complemented with detailed studies, and characterizations 
of genetic populations can help identify useful quantitative 
trait loci (QTLs) and candidate causal genes, some of which 
have already been successfully introgressed into cultivated 
accessions, for example, rice 4 , maize 5 and wheat 3 , for crop 
improvement. 

Soybean was domesticated in China ~ 6,000-9,000 years ago 6 , 
and, due to bottlenecks and human selection, cultivated soybeans 
have much lower genetic diversity than their wild counterparts 2,7 . 
This reduced variation has potentially resulted in the loss of some 
genes important for the adaptation to different environments. 
Wild soybeans that exhibit a high allelic diversity may therefore 
be a resource for genes for adapting to certain environmental 
conditions to be re- introduced into domesticated soybeans via 
breeding. This can be done since there is no reproductive barrier 
between wild and cultivated soybeans. 

Here we formulate a strategy to combine the whole-genome- 
sequencing approaches with de novo sequencing of a wild 
soybean genome and construction of genotyping-by-sequencing- 
based genetic map using a recombinant inbred (RI) population to 
assist in the genetic studies of important QTLs in the wild 
soybean genome. Further analyses with reference to the 
previously obtained re- sequencing data of germplasms 2 help in 
identifying GmCHXl, a candidate causal gene for salt tolerance. 

Results 

De novo assembly and annotation of a wild soybean genome. We 

performed whole-genome de novo sequencing (Supplementary 
Table 1) on W05, a wild soybean accession from China that our 
group previously characterized as having high tolerance to salt and 
is genetically and phylogenetically distinct from cultivated soy- 
beans 2 . Its genome size is estimated at 1.17 Gb by K-mer statistics 
(Supplementary Fig. 1), which is close to the estimated 1.12 Gb 
size of the cultivated soybean genome Williams 82 (ref. 8). 
SOAPdenovo 9 software is used to build a 868-Mb assembly with 
sequencing reads (Table 1 and Supplementary Table 2). The 3,281 
largest scaffolds represent 90% of the assembled genome. More 
than 94% of the 742,658,772 cleaned reads from short insertion 
size libraries can be mapped to the 868 -Mb genome, with 
82% exhibiting proper pair-end relationships (Supplementary 
Table 3). 

Wild soybean genes are annotated using a combined gene 
model prediction strategy, integrating ab initio modelling, 
homology searching, expression- sequence tags and also using 
RNA-Seq data from transcriptomes generated for this project 
(trifoliate and primary leaves, and roots of young W05 seedlings; 
Supplementary Tables 4 and 5). To assess our annotation, we 
perform de novo assembly of the transcriptome raw data using 
Trinity 10 , and find a match for the majority of the assembled 
Trinity contigs in the annotated genome (Supplementary 
Table 6). The Core Eukaryotic Genes Mapping Approach 11 
further shows that the W0 5 (wild; this work) and Williams 
82 (cultivated; 8 ) genomes are 86.7% and 90.3% complete, 
respectively (Supplementary Table 7). (See Supplementary 
Figs 2 and 3 and Supplementary Tables 8 and 9 for detailed 
annotation information of the W05 genome.) 



RI population re-sequencing. We construct a RI population 
by crossing the de novo-sequenced W05 with the re-sequenced 
cultivated soybean accession C08, a close relative of 
Williams 82, which has phenotypes different from those of 
W05 (Supplementary Table 10). A core panel of 96 RI lines is 
selected for ~ 1 x depth low-level whole-genome re-sequencing 
(Supplementary Table 11). After filtering, we identify 1,798,504 
high-quality single-nucleotide polymorphisms (SNPs) that are 
distributed preferentially in the gene-harboring chromosome 
distal ends (Supplementary Fig. 4) owing to the low frequency 
of meiotic recombinant events at the peri-centrimeric 
regions 12 . Heterozygosity of each RI line is low, reflecting the 
successful construction of a highly homozygous RI population 
(Supplementary Table 11). 

Recombinant event determination. We generate a bin map and 
determine the recombinant breakpoints of this RI population 
using the Maximum Parsimonious Inference of Recombination 
package 13 . Two thousand seven hundred and fifty- seven bin 
markers along the 20 chromosomes are identified, with their 
physical coordinates referenced to the Williams 82 genome 
(assembly version Glymal.l) (Fig. 1 and Supplementary Data 1). 
The total genetic length of this bin map is 2,992.0 cM 
in the Kosambi mapping function, with a mean interval 
between markers of 0.94 cM and an average physical bin 
length of 306.0 Kb. More than 60% of the recombinant 
breakpoints are located at the chromosome distal ends, 
consistent with a recent report 14 . Traditional simple sequence 
repeat markers are used to verify the quality and accuracy of the 
bin map. More than 94% of the PCR bands from experimental 



Table 1 | Draft genome assembly and annotation of the wild 


soybean accession W05. 






Genome 


N50 (Kb/no.) 


N90 


Total length 


assembly 




(Kb/no.) 


(Mb) 


Contigs 


24.2 (8,897) 


3.4 (40,973) 


808.7 


Scaffolds 


401.3 (442) 


43.8 (3,281) 


868.0 


Genome 


Total no. 


Function 


Average CDS 


annotation 




assigned 


length (bp) 


Protein-codin 


g 52,395 


49,560 


1,083.9 


genes 












Copies 


Length (bp) 


Non-coding 


rRNAs 


100 


13,312 


RNAs 










tRNAs 


864 


64,898 




miRNAs 


376 


45,953 




snRNAs 


437 


46,398 






Length (bp) 


Percentage (%) 


Transposable 


DNA 


57,532,948 


6.63 


elements 


transposons 








LTR 


268,111,653 


30.89 




LINE 


16,640,627 


1.92 




SINE 


1,204,272 


0.14 




Low_complexity 


3,218,693 


0.37 




Simple repeat 


12,984,986 


1.50 




Satellite 


1,875,157 


0.22 




Other 


346,855 


0.04 




Novel/unknown 


14,841,785 


1.71 




Total 


376,756,976 


43.41 


CDS, average coding 


sequence; LINE, long interspersed nuclear elements; LTR, long terminal 


repeat; SINE, short in 


terspersed nuclear elements. 
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Figure 1 | QTL identification using bin map of a Rl population, (a) Recombinant bin map of a core panel of 96 Rl lines. Red and blue indicate parental 
genotypes from W05 and C08, respectively, (b) LOD score distribution of 11 agronomic traits with major QTLs. Maximum LOD score of each 
major QTL is indicated next to the peak. Different line colours indicate data collected in different years (yr08, 2008; yr09, 2009; yrlO, 2010; yrll, 2011; 
and yr12, 2012). 



screening (total = 61,856) share the same genotype with the 
corresponding bins. 

High-density-marker QTL mapping of soybean traits. We 

record 18 agronomic traits from the core panel Rl lines and the 
two parents from different years (Supplementary Table 10 and 
Supplementary Fig. 5). Most of the phenotypic data are quanti- 
tative and exhibit a normal distribution (Supplementary Fig. 5). 
No major QTLs are found for seed oil content, seed protein 



content, 100-seed weight, leaf area, plant height, node number 
and branch number. However, we identify 15 major QTLs for 11 
other traits from the high-density marker QTL map using 
QTLCartographer (http://statgen.ncsu.edu/qtlcart/) (Table 2). 
Sharp peaks in the log-of-odds (LOD) score curves for each QTL 
are obtained, spanning 11 of the 20 chromosomes (Fig. lb). The 
LOD score cutoff for QTL identification ranges from 3.80 to 7.60, 
whereas the maximum LOD score for each trait ranges from 4.20 
to 43.60. The identified QTLs occupy a physical length of 176 Kb 
to 1.28 Mb (average: 682.0 Kb; median: 601.6 Kb), with the genetic 
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distances ranging from 0.5 to 7.9 cM (average 2.4 cM; median: 
2.0 cM). For all mapped traits, the LOD score distributions cal- 
culated from trait data collected in different years are generally 
consistent (Fig. lb). We map six QTLs for five of the 11 traits to 
genomic positions that coincide with QTLs in previously pub- 
lished reports (Table 2 and Supplementary Table 12), providing 
support for the accuracy of this map. We identify nine new QTLs 
for seven traits, including total seed antioxidant content, seed 
anthocyanin content, pod number per plant, seed number per 
plant, growth period, pod colour and trailing growth (Table 2 and 
Supplementary Table 12). 

We also examine the relationship among different QTLs. Three 
QTL clusters are identified on Chr08, Chrl3 and Chrl9 (Fig. 1 
and Table 2). Statistically significant correlations are observed 
among phenotypic characteristics mapped onto the same QTL 
cluster (Supplementary Table 13). Cluster 1 on Chr08 contains 
QTLs for two closely related traits: total seed anthocyanin content 
and seed coat colour. This makes sense because anthocyanin is a 
pigment that is mainly stored in the seed coat 15 . Cluster 2 on 
Chrl3 contains QTLs regulating typical traits that differentiate 
wild from cultivated soybeans: trailing growth, pod number per 
plant and seed number per plant 16 . The trailing growth character 
of wild soybean results in a relatively longer overall growth 
period, which may lead to higher number of pods and more seeds 
per plant. The location of this cluster does not overlap with the 
previously identified QTL for indeterminate growth 17 . Cluster 3 
on Chrl9 contains QTLs that control pod colour and total seed 
antioxidant content. The close relationship between pod colour 
and total seed antioxidant content can be explained by two 
possible scenarios: (i) the pathway controlling the accumulation 
of pod colour pigmentation is also involved in the accumulation 
of seed antioxidants; or (ii) the genes controlling the two traits are 
incidentally located in close proximity. 



Identifying a possible causal gene for salt tolerance. We focus 
our subsequent analyses on the salt tolerance locus we have 
identified in WO 5, which exhibits a much higher salt tolerance as 
compared with that of C08 (Fig. 2a). We find that salt tolerance is 



a dominant trait in our population. We identify a salt tolerance 
locus from W05 that spans a genomic region of 978 Kb and 
overlaps with the previously reported Ncl locus 18-21 . To better 
define this locus, we screen the remaining RI population and 
obtain two extreme groups (85 tolerant and 73 sensitive). 
Recombinant breakpoint analyses of the two groups using 
simple sequence repeat and SNP markers reduce the salt 
tolerance locus to a 3 88 -Kb region on Chr03, which contains 
43 predicted genes based on the annotated Williams 82 genome 
(Fig. 2b). 

When soybean plants are subjected to NaCl treatment, the 
Na + content and Na + /K + ratio in leaves are significantly lower 
in W05 compared with C08 (Fig. 2c), suggesting that ion 
transporters has an important role in salt tolerance. BLAST 
analysis of the 43 predicted genes reveals two genes (Gly- 
ma03g32890 and Glyma03g32900) that have a high degree of 
similarity to cation H + exchangers (CHXs) 22 . This 388-Kb 
region on Chr03 also has a duplicated segment on Chrl9, likely a 
result of the past whole-genome duplication, but it does not 
possess a salt tolerance QTL. The two CHX genes present in the 
QTL on Chr03 are either absent or truncated in the duplicated 
segment on Chrl9. 

Using the de novo sequencing data of W05, we identify a single 
scaffold containing the corresponding 388-Kb genomic region. 
Comparison of the genomic sequences of W05 and Williams 82 
shows that Williams 82 had a Tyl/copia retrotransposon 23 
inserted into exon 3 of the cation H + exchanger gene 
Glyma03g32900, but not in its counterpart Glysoja01g005509 in 
W05 (Fig. 2d). PCR amplification and Sanger sequencing verify 
that the salt- sensitive parent C08 also contains the same insertion 
as Williams 82 (Supplementary Fig. 6a). 

3'-RACE experiments show that Glysoja01g005509 in C08 
produces a truncated transcript (encoding only 376 amino-acid 
residues versus 811 residues in the full-length Glysoja01g005509 
in W05) (Fig. 3a). The salt- sensitive C08 appears to possess 
a loss-of-function variant of Glysoja01g005509 while real-time 
PCR analyses reveal that Glysoja01g005509 in W05 expresses 
a full-length transcript that is root specific (Supplementary 
Fig. 6b). 



Table 2 | Major QTLs identified. 



Agronomic traits 


LOD cutoff* 


Chr. no. 


Var (%) 


Start position 


QTL position 

End position Physical 
length (Kb) 


Genetic 
distance (cM) 


Overlapped QTL 1 
locus identified in 
previous studies 
(putative causal genes) 


Seed antioxidant content 


4.1 


19 


30.5 


37,424,658 


38,133,607 


709.0 


2.3 


NR 


Seed anthocyanin content 


7.6 


8 


68.13 


8,019,871 


8,609,212 


589.3 


1.1 


NR 


Pod no. per plant 


3.8 


13 


15.58 


38,761,376 


40,041,964 


1280.6 


2.6 


NR 


Seed no. per plant 


3.8 


13 


17.59 


38,761,376 


40,041,964 


1280.6 


2.6 


NR 






17 


11.00 


9,246,695 


10,131,683 


885.0 


7.9 


NR 


Growth period 


4.1 


6 


15.53 


19,518,049 


20,432,589 


914.5 


0.5 


E1 (E1) 






10 


15.44 


43,782,110 


45,008,063 


1226.0 


5.4 


E2 (GmG/o) 






11 


7.80 


11,167,414 


11,586,421 


419.0 


1.8 


NR 






12 


23.00 


5,236,748 


5,702,668 


465.9 


2.4 


NR 


Salt tolerance 


4.3 


3 


54.61 


40,204,091 


41,182,426 


978.3 


4.7 


Ncl 


Seed coat colour 


7.0 


8 


78.53 


7,902,342 


8,576,842 


674.5 


2.3 


1 (CHS gene cluster) 


Pod colour 


4.6 


19 


54.30 


37,367,542 


37,786,063 


418.52 


1.9 


NR 


Trailing growth 


4.0 


13 


28.43 


38,761,376 


39,342,212 


580.84 


1.9 


NR 


Leaf length/width ratio 


3.9 


20 


35.92 


34,864,294 


35,465,896 


601.6 


2.0 


in (Gm-JAGGEDl) 


Nodule no. per plant 


4.6 


16 


53.36 


36,484,899 


36,661,375 


176.5 


1.2 


Rj2 (Rj2) 



Chr., chromosome; LOD, log-of-odds; QTL, quantitative trait loci; NR, no report of major QTLs of the corresponding trait previously found in the same region; Var, variant. 
Grey boxes indicated previously found QTLs. 

*LOD score cutoff of major QTLs was determined by permutation tests (1,000 times; P<0.05). 

tDetailed information of the previous identified QTLs and their putative casual genes were listed in Supplementary Table 12. 
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Figure 2 | Identification of a putative causal gene in the salt tolerance locus, (a) W05 exhibits a higher salt tolerance than C08. Bottom photos: primary 
leaves, (b) A 978-Kb salt tolerance locus is first identified by LOD score (upper panel), and narrowed down to a 388-Kb region using SNP and 
simple sequence repeat markers (middle panel) and extreme groups (lower panel). SL, sensitive line; TL, tolerant line, (c) W05 accumulates less Na + in 
the leaves than C08 72 h after NaCI treatment. N = 4. Error bars = s.e.m., P<0.05. (d) Gene structure of GmCHXI in W05 and Williams 82. Arrows 
indicate primer positions for insertion and real-time PCR studies. FW, fresh weight. 



Phylogenetic analyses comparing Glysoja01g005509 to other 
monovalent cation/proton antiporters (CPA) that use mono- 
valent ions as their substrate 22 (Supplementary Fig. 7) show that 
the predicted gene product of Glysoja01g005509 belongs to the 
CHX clade of this superfamily. We name Glysoja01g005509 
GmCHXI to reflect its putative function. 

GmCHXI is a potential salt tolerance determinant in soybean. 

To further correlate the GmCHXI gene with salt tolerance, we 
analyse 23 previously re-sequenced soybean accessions (12 salt- 
tolerant; 11 salt-sensitive) 2 . The SNPs in the coding region of 
GmCHXI of the tolerant group, including W05, are conserved, 
whereas those of the sensitive group show varied genotypes 
(Fig. 3a,b). This difference is unique to only GmCHXI within the 
entire 388-Kb region, supporting the claim that GmCHXI is a 
putative causal gene for salt tolerance. More specifically, in the 
sensitive group, cultivated accessions C08 and C27 carry the same 
Tyl/copia retrotransposon as Williams 82 (Supplementary Fig. 8), 
and wild salt- sensitive accessions (W06, W07, W08 and W09) 
have at least two types of deletions in their GmCHXI exons 
(Fig. 3a). All SNPs within a 2-Kb promoter region of GmCHXI are 
highly conserved in the tolerant accessions, but are varied among 
the sensitive accessions (Fig. 3c). Real-time PCR data show that all 
the sensitive accessions analysed produce undetectable or very low 
levels of full-length GmCHXI transcripts as compared with the 



tolerant accessions (Fig. 3d and Supplementary Fig. 9). Therefore, 
the GmCHXI gene in the sensitive accessions is either non- 
functional or is expressed at very low levels. 

Rapid gain-of-function tests of GmCHXI. To further validate 
that GmCHXI is a salt tolerance determinant in WO 5 and other 
salt-tolerant accessions tested, we perform a gain-of-function test 
by expressing the GmCHXI cDNA from W05 in the hairy root 
culture of C08. Expression of the transgene is confirmed by real- 
time PCR (Fig. 4c). In the absence of NaCI treatment, both root 
cultures transformed with either GmCHXI or green fluorescent 
protein (GFP; control) give healthy hairy roots (Fig. 4a). How- 
ever, when subjected to NaCI treatments, roots transformed with 
GmCHXI show significantly higher root fresh weights than the 
control (Fig. 4b), demonstrating that GmCHXI from the major 
salt tolerance locus in W05 can alleviate salt stress. 

We further confirm the function of GmCHXI using transgenic 
tobacco BY- 2 cells. Consistent with the hairy root assay, 
transgenic BY-2 cells ectopically expressing GmCHXI show a 
higher survival rate under the treatment with 100 mM NaCI 
compared with the untransformed wild-type and the GFP 
transgenic control (Fig. 5a,b). Moreover, the GmCHXI transgenic 
lines also maintain a lower Na + /K + ratio under salt stress 
(Fig. 5c), supporting our hypothesis that GmCHXI is involved in 
ion homeostasis. 
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Figure 3 | The GmCHXI gene is conserved among salt-tolerant soybean germplasms. (a) Structural variations in the GmCHXl coding region of 
salt-sensitive germplasms, with salt-tolerant W05 as comparison, (b) SNP analyses and multiple amino-acid sequence alignments of the GmCHXI coding 
region from different soybean germplasms. X denotes ambiguous amino acid due to low coverage of re-sequencing data. Black indicates conserved 
residues, (c) SNP analyses and multiple alignments of a 2-Kb promoter region of GmCHXI from different soybean germplasms. Conserved bases are 
highlighted in black. R = G/A, Y = T/C. (d) Expression study of GmCHXI in soybean germplasms. Real-time PCR of the GmCHXI gene using primers 
specific to regions upstream (upper panel; for tolerant lines and C08) or downstream (lower panel; for all germplasms) referenced to the retrotransposon 
insertion site in C08, using RNA from salt-tolerant (green bars) and salt-sensitive germplasms (blue bars). N = 3. Error bar = s.e.m. 



Discussion 

One important goal of genomic and genetic studies of crop 
plants is to identify important loci and genes using unique 
germplasm resources, such as wild accessions, that can be 
used to improve agronomic traits and thereby agricultural 
productivity. Wild soybeans, in particular, are valuable genetic 
resources for improving soybean cultivation, owing to the rich 
collection and high genomic diversity of wild soybeans 1 ' 2 ' 7 , the 
absence of sexual reproduction barrier between wild and 
cultivated soybeans, and the availability of a reference genome 8 
and SNP information 2 . 

In this study, we further explore the wild soybean genome by 
performing de novo genome sequencing of a wild soybean 
accession (W05) that has been shown to be genetically distinct 
from the cultivated model soybean Williams 82 (ref. 2). A de novo 
genome can facilitate the identification of genetic materials from 
wild soybean for crop improvement because comparing de novo 
genomes is more effective in identifying small and large structural 
variations between genomes 24 ' 25 . To perform genetic analysis, we 
construct an RI population resulting from a cross between the de 
novo-sequenced W05 and the re-sequenced cultivated soybean 

6 



accession C08 (ref. 2), a close relative of Williams 82. By adopting 
the bin map strategy 26 via re-sequencing of a core panel of the RI 
lines, together with phenotypic analyses of the core panel, we 
complete the mapping of major QTLs that regulate important 
agronomic traits such as nutritional quality, yield and stress 
tolerance. The genomic spans of most identified QTLs are narrow 
enough for use in marker- assisted breeding. 

Correlation studies of different QTLs also reveal potential 
interesting relationships among different traits. For example, the 
QTL for seed anthocyanin content does not overlap with the 
major QTL for seed antioxidant content, thus disproving 
anthocyanin as the most important antioxidant in seeds. 
Furthermore, the QTL of the trailing growth character overlaps 
the QTLs for pod number per plant and seed number per plant. 
The higher pod number and seed number per plant in wild 
soybeans are likely a result of the trailing growth. These traits are 
important for wild soybeans to survive in the natural environ- 
ment by expanding their territories. However, human selection 
might have removed the trailing growth character in cultivated 
soybeans to ease the harvesting process and to increase crop 
density per unit area. 
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Figure 4 | Gain-of-function analysis of GmCHXI using the hairy root system, (a) Phenotypes of transgenic hairy roots expressing either GFP or 
GmCHXI with or without NaCI treatment. Photos were taken 2 weeks after treatment, (b) Fresh weight of hairy roots with or without NaCI treatment. 
N>12. Error bars, s.e.m. Data were analysed using Student's f-test. NS, not significant. ***P< 0.001. (c) Expression of transgenes validated by real-time 
PCR. Upper panel: GFP; lower panel: GmCHXI. N>12. Error bars = s.e.m. 
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Figure 5 | Gain-of-function analysis of GmCHXI using transgenic tobacco BY-2 cells, (a) Trypan blue staining of tobacco BY-2 cells under NaCI 
treatment. Four-day-old cells were treated with 100 mM NaCI or remained untreated in MS medium for 20 h before staining with Trypan blue. Nuclei of 
dead cells were stained blue. Scale bars, 100 |im. (b) Calculated survival rate based on the results of Trypan blue staining. The data were calculated 
from 14 randomly taken photos for each sample, (c) Na + /K + ratio of BY-2 cells under NaCI treatment. BY-2 cells were treated with 100 mM NaCI 
or remained untreated in MS medium for 4h. Ion contents were determined using atomic absorption spectrophotometry. N = 4. ND, non-detectable 
due to low signals. Numerical data in b and c were analysed using one-way analysis of variance followed by the Tukey's post hoc test (P<0.05). Error 
bars = s.e.m. Line no. 8 and Line no. 9 are two independent transgenic lines of GmCHXI WT, untransformed wild type; and GFP, BY-2 cells transformed with 
the GFP gene as a negative control. 



Salinization poses a severe threat to agricultural productivity, 
affecting more than 20% of irrigated lands globally . The wild 
soybean accession W05 exhibits a high tolerance towards salt 
stress and hence is a good genetic resource for improving salt 
tolerance in soybean. By exhausting the lines in our RI population 
for QTL mapping and extreme group study, we have identified a 
388-Kb major salt tolerance locus in the soybean genome. 



The exceptionally long linkage disequilibrium of the soybean 
genome 2 has posed a major obstacle in finding causal genes in a 
QTL. To identify the causal gene for salt tolerance within the 
3 88 -Kb region, we make use of a combination of genomic 
information and physiological data. The lower Na + distribution 
in the leaves of the salt-tolerant line W05 compared with that in 
the salt- sensitive C08 has drawn our attention to ion transporter 
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genes. By comparing the de novo genome sequencing data of the 
wild accession W05 to those of C08 and the reference genome 
Williams 82, we reveal the presence of a retrotransposon insertion 
in the ion transporter gene GmCHXl of the salt-sensitive C08 and 
Williams 82, whereas the salt-tolerant W05 retains an intact 
GmCHXl sequence. This kind of structural variation is hard to be 
detected without de novo sequencing data. Whole-genome-re- 
sequencing data are used to examine the changes in the gene 
structure of GmCHXl in genetically unrelated germplasms to 
validate the correlation between salt sensitivity and loss-of- 
function or low gene expression of GmCHXl. The gain-of- 
function tests have finally shown that GmCHXl can confer salt 
tolerance, probably via lowering of the Na + /K + ratio. 
Interestingly, the leaves of the salt-tolerant W05 also exhibit a 
lower Na + /K + ratio than the salt- sensitive C08. 

It appears that the functional GmCHXl gene is an ancestral 
trait and the sensitive accessions have accumulated different types 
of mutations either in the coding region (eliminating/ reducing 
the gene function) or the promoter region (lowering the 
expression). At this point, we cannot completely eliminate the 
possibility that the causative mutation might involve small RNA 
regulation or control by an adjacent transcription factor gene. 
However, these possibilities are less likely, given that the only 
annotated transcription factor gene located within the 3 88 -Kb 
region does not exhibit a major sequence difference between 
tolerant and sensitive parents. The elimination of GmCHXl in 
salt-sensitive germplasms may be an example of negative 
selection against a stress tolerance gene in unstressed environ- 
ments. The expression of stress tolerance genes can be an energy 
burden on the plant if the functions of these genes are not 
required. 

Although it has long been believed that CI ~ homeostasis has a 
critical role in soybean salt tolerance 28-30 , accumulated evidence 
shows that cation transporters are major salt tolerance determinants 
in plants. For example, high-affinity potassium transporters, which 
unload Na + from the root xylem, reduce Na + accumulation in the 
leaves of important crops such as rice and wheat 31 ' 32 . The CPAs 
constitute another important superfamily of ion transporters, which 
can be further divided into two clades: CPA1 and CPA2 (ref. 22). 
Members of the CPA1 subfamily such as Salt Overly Sensitive 1 
(SOS1) and Na + /H+ antiporters 1 (NHX1), which are localized 
on the plasma membrane and tonoplast, respectively, can protect 
plants by excluding or compartmentalizing Na + (refs 33, 34). The 
possible functions of NHX homologues isolated from soybean have 
also been reported 35 . CHXs are members of the CPA2 subfamily 
whose protective functions remain largely unclear. A few studies 
using other plant models implicate that some CHXs may be 
involved in Na + transportation and salt tolerance 36-38 . Here we 
provide gain-of-function data to confirm that the GmCHXl gene 
identified by our genomic and genetic studies confers salt tolerance. 

In summary, we have developed an efficient strategy using a 
combination of different whole-genome-sequencing approaches 
that can greatly enhance the efficiency of uncovering QTLs and 
genes for beneficial traits in crop breeding. By using the de novo 
sequencing data of a wild soybean genome and the re- sequencing 
data of soybean germplasms, we are able to identify a candidate 
causal gene for salt tolerance, validated via gain-of-function tests. 
In addition, the sequence information on the 868 -Mb de novo- 
assembled wild soybean genome and the mapping of 15 major 
QTLs resulting from this work will also benefit soybean 
researchers and breeders for further mining of the wild soybean 
genome. 

Methods 

W05 de novo sequencing and assembly. Eight libraries of different insertion 
sizes (180, 260, 326, 817, 2, 2.3, 6 and 10 Kb) were sequenced using the Illumina 



GAII platform. Raw reads were filtered to eliminate sequencing errors, by removing 
reads with N or polyA for > 10% of bases, low quality reads (reads from short 
insert-size libraries of 180-817 bp: with 50 or more bases having Phred-scaled 
quality score (Q- score) lower than or equal to 7; reads from large insert- size 
libraries of 2-10 Kb: with 15 or more bases having Q-score lower than or equal to 
7), reads contaminated with adapter sequence (with > 10 bp aligned to the 
adapter, allowing three mismatches at most), reads with wrong insertion size either 
due to paired or overlapping reads, and reads identical in both ends resulting 
from PCR amplification. The minimum average Q-score for each library is Q20 
after trimming. After data pre-processing, we obtained a total of 73.5 Gb clean 
reads. SOAPdenovo 9 was used to perform genome assembly. Clean data from 
small insert-size libraries were used to construct contigs, with K-mer set to 35 bp, 
and clean data from large insert- size libraries were used to connect contigs 
to form scaffolds according to the pair-end relationship. The scaffold gaps 
were then filled, with scaffolds of 200 bp or below being discarded from the 
final version. 



Repeat and non-coding RNA annotation. To annotate the transposable elements 
in the wild genome, de novo and homology prediction were applied. For the de 
novo annotation, RepeatModeler (part of RepeatMasker) and LTR_FINDER 39 were 
used to identify the de novo repeats to build a library for transposable elements. 
The de rcovo-based repeat library, soyTEdb 23 , and Repbase 40 library were then used 
to identify the repeats by homology using RepeatMasker. To annotate the tandem 
repeats, Tandem Repeat Finder 41 and RepeatMasker (parameter '-noint', http:// 
www.repeatmasker.org, version 3.2.9) were used. To annotate the protein-coding 
transposable elements, RepeatProteinMask (part of RepeatMasker) was used to 
search the repeat-related proteins against the transposable element protein 
database (http://www.repeatmasker.Org/RepeatProteinMask.html#database). 

tRNAscan-SE 42 and INFERNAL 43 were used to predict non-coding RNAs 
within the wild soybean genome using parameters for eukaryotic species. Pseudo- 
tRNA and tRNA genes overlapping with SINE repeat regions were discarded. To 
identify ribosomal RNA fragments, Blastn (E- value = le -5 , aligned length 
> 50 bp) was used to align potential ribosomal sequences with plant ribosomal 
RNAs. Both microRNA and small nuclear RNA genes were predicted by 
INFERNAL against the Rfam database 43 . 

Gene model prediction and annotation. De novo gene prediction, homology 
prediction and transcript sequence-based annotation were used to identify protein- 
coding genes. Augustus (version 2.5.5; ref. 44) and Glimmer-HMM (version 3.0.1; 
ref. 45) were used in the de novo gene prediction using the repeat masked scaffold 
sequence based on the HMM model, with parameters trained for Arabidopsis 
thaliana. Homologous proteins from related species were mapped to the genome 
using BLAST (version 2.2.23, £-value cutoff set at le - 5 ). Aligned sequences as well 
as its query proteins were filtered, and Gene Wise (version 2.2.0; ref. 46) was then 
used for searching accurately spliced alignments. EST sequences of G. max and 
G. soja (from NCBI) were aligned to scaffolds using BLAT (version 34, G. max: 
identity >0.93, coverage >0.8; G. soja: identity >0.95, coverage 0.8) to generate 
fragmental alignments. These alignments were linked together using PASA 
according to sequence overlaps 47 . RNA-Seq data were generated using three cDNA 
libraries (insertion size 200 bp) constructed from the total RNA of trifoliate and 
primary leaves, and roots of young seedlings of W05, and were sequenced with 
Illumina GAII to produce PE reads (75 bp for each read). A total of 7.5 Gb of data 
were produced. RNA-Seq data from W05 were aligned to scaffolds using TopHat 48 
to identify candidate exon regions. The donors and acceptors at the splicing sites 
were then identified. Finally, the transcripts were assembled using Cufflinks 
(version 2.1.1; ref. 48) for the EST and RNA-Seq data, which make up the third 
gene set. GLEAN 49 was then used to integrate all three gene predictions (de novo, 
homology-based and transcript-based) into a consensus gene set. 

Gene functions were assigned according to the best match of the alignments 
using Blastp to the SwissProt and TrEMBL databases (Uniprot release 2012_03). 
Motifs and domains were determined by InterProScan (version 4.7) against protein 
databases including ProDom, PRINTS, Pfam, SMART, PANTHER and PROSITE. 
Gene Ontology IDs were obtained from the corresponding InterPro entries. This 
Gene Ontology category was compared with that of the published Williams 82 
annotation (Supplementary Fig. 3). All genes were aligned against the Kyoto 
Encyclopedia of Genes and Genomes (KEGG) database (release 58) and the 
pathways in which the genes might be involved were derived from the matched 
genes in KEGG. 

To independently validate the annotation, the transcriptome sequencing data 
were assembled using Trinity r2013-02-25 with a minimum contig length of 200 
(ref. 10). TransDecoder (http://transdecoder.sourceforge.net/) was used to extract 
best protein-coding regions from the Trinity transcripts. Trinity contigs were 
mapped to the annotated genome using the following parameters: (i) E- 
value = le -50 ; (ii) identity >0.9; and (iii) the transcripts can be longer than the 
annotated sequence, but cannot be shorter than the annotated sequence by 100 bp. 

CEGMA (v2.4.010312) was used to access the genome completeness . 

Rl population construction and phenotype recording. The wild parental line 
W05 originated in Henan Province in China and the cultivated soybean C08 was a 
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close relative of Williams 82 from USA (http://wildsoydb.org/strains/soybean). 
W05 and C08 were reciprocally crossed to obtain Fl seeds. Fl plants were self- 
pollinated to obtain F2 seeds. Single- seed descendants were propagated from F2 to 
F7 to obtain a RI population consisting of 455 independent lines. Some segregants 
at higher generations were collected because of differences in agronomic traits, 
making a total RI population of 552 lines. Starting from F8, mixed seeds were 
collected and propagated for each line. A core panel of 96 RI lines that exhibited a 
diverse spectrum of growth period duration was selected for sequencing and 
mapping analyses. 

Eighteen phenotypic traits of this population were recorded in a greenhouse 
(22°25'7" N, 114°12 / 26 // E) or in open fields in an experimental farm (38° 35' 0" N, 
116° 48' 0" E) from 2008 to 2012. Detailed descriptions of the measurement 
method for each trait are provided in Supplementary Table 10. 

Population re-sequencing and SNP identification. The HiSeq 2000 sequencing 
platform was used to generate 1.24 G 75-mer pair-end reads of the selected core 
panel of 96 RI lines, with an average of 12.9 M reads for each RI line. Properly 
paired and uniquely mapped reads were then used to call SNPs. Reads containing 
> 10 Ns were filtered before alignment. SOAP2 (ref. 50) was used to map filtered 
reads back to the Williams 82 genome using the following parameters: seed length 
of 32 bp, maximum three mismatches allowed in each read, minimum aligned 
length of 35 bp to identify a mapped read. Reads that could not be properly paired 
or resulted in multiple alignment sites were discarded. 

SNPs of the population were called by SAMtools 51 with the following 
parameters: base quality of the SNP site no <30 and no >3 SNPs in any 10 bp 
window along the reference genome. Filtered SNPs were used to calculate the 
heterozygosity of each RI line using H = ((no. of heterozygous SNPs)/(no. of total 
SNPs)) x 100%. 



Genotyping and QTL identification. The SNPs were further filtered by the fol- 
lowing criteria: SNPs were of the two parental genotypes and homozygous, and 
there were at least 20 RI lines with SNPs at a single site. The maximum parsi- 
monious inference of recombination package was adopted to identify recombinant 
breakpoints and generate bins. The R/qtl package was used to calculate the genetic 
distance between markers in Kosambi mapping function 13 . A recombinant hot 
spot was identified between Chrl3 (physical position from 20,343,655 to 
33,600,100 bp) and Chrll (physical position from 23,866,522 to 39,171,759 bp) 
based on pair- wise recombination fractions and LOD scores (Supplementary 
Fig. 10). 

QTLs were identified using QTLCartographer (http://statgen.ncsu.edu/qtlcart/). 
Interval mapping and composite interval mapping were performed for each data 
set with a 10-cM scanning window and a 0.5-cM walking step. For the 
identification of major QTLs, the LOD score cutoff was determined by 1,000 times 
permutation at P = 0.05. The boundary of each major QTL was then denned by a 
LOD score-drop of 1.5 (ref. 52). The correlation between different phenotypes was 
calculated using the Predictive Analytics Suite Workstation Statistics (PASW 
Statistics 18.0.0). 

NaCI treatments and ion content analyses of soybean plants. Soybean seeds 
were germinated in a greenhouse on vermiculite. When the primary leaf appeared, 
the seedlings were transferred to a hydroponic system with half- strength Hoag- 
land's solution for 5 days. Salt treatment was applied by replacing the growth 
solution with fresh half-strength Hoagland's solution containing 0.9% NaCI (or no 
NaCI for the untreated control). For phenotype observations, the seedlings were 
treated for 6 days. For ion content analyses, tissues were collected at 24 or 72 h after 
treatment and ground in 1% acetic acid for soluble ion extraction. Na + and K + 
ion contents were analysed using the flame atomic absorption spectrophotometer 
(Hitachi Z2300) according to the manufacturer's instructions. 

Fine mapping of salt tolerance locus. Primers for SNP genotyping were designed 
based on the re- sequencing data 2 and the Williams 82 reference genome 8 using the 
ARMS-tetra primer method with the BatchPrimer3 programme . A list of primers 
used can be found in Supplementary Data 2. PCR was conducted in a 15-uI 
reaction mixture containing 1 x buffer, 3 mM MgCl 2 , 0.2 mM dNTPs, 0.3 uM of 
each primer, 100 ng genomic DNA template and 0.5 U GoTaq DNA polymerase 
(Promega Corp., Fitchburg). The thermal cycler was programmed with an initial 
2 min at 94 °C, followed by 35 cycles of 30 s at 94 °C, 30 s at the primer- specific 
annealing temperature and for a marker- specific extension time at 72 °C 
(Supplementary Data 2). After the thermal cycles, an additional 10 min at 72 °C 
was added to allow for the completion of extension. PCR products were resolved on 
2% agarose gel in 1 x TAE buffer. 

Analyses of GmCHXI structural variations. The consensus sequences of 
GmCHXl in different soybean germplasms were from previous mapping results 2 , 
with reference to the genomic coordinates of Williams 82, and were then translated 
into amino-acid sequences. Differences among the consensus sequences were 
examined after alignment by the ClustalW2 programme 54 . 



Retrotransposon insertions in the GmCHXl gene were amplified from genomic 
DNA and detected using PCR (Supplementary Figs 6 and 8). The Sanger method 
was used to confirm the sequence of the coding region of GmCHXl in different 
germplasms (tolerant: C01, C14, W01-W05 and Wll; sensitive: C08, C12 and 
W06-W10). 



Cloning and gene expression of GmCHXl. The cDNA of GmCHXl was obtained 
from total RNA extracted from the root. 3' RACE was performed using the 
SMARTer RACE cDNA Amplification Kit (Clontech, cat. no. 634923, CA) 
according to the manufacturer's instructions. 

Real-time PCR was conducted using the One Step SYBR PrimeScript RT-PCR 
Kit II (TaKaRa Biotechnology Co. Ltd, Dalian, China) according to the 
manufacturer's instruction, with a reaction volume of 20 ul containing 100 ng total 
RNA, using the CFX96 Touch Real-Time PCR Detection System (Bio-Rad, 
Hercules). Primers for real-time PCR were listed in Supplementary Data 2. The 
relative expression of target genes was calculated using the method 55 . 

Gain-of-function test of GmCHXl in hairy root system. The full-length coding 
sequence of GmCHXl from W05 was cloned into the binary vector V7 (ref. 56) 
between Xbal and Xhol sites downstream of the constitutive Cauliflower Mosaic 
Virus 35S promoter. As a negative control, the gene for the GFP was cloned instead 
of GmCHXl using the same vector and promoter. Both constructs were then 
transformed into the salt- sensitive parent C08. The soybean hairy root 
transformation and salt treatments were performed as previously described 57 with 
some modifications. Surface- sterilized soybean seeds were germinated on 
germination medium (15 mg 1 _ 1 NaH 2 P0 4 • H 2 0, 1 mM CaCl 2 • 2H 2 0, 25 mM 
KN0 3 , 1 mM (NH 4 ) 2 S0 4 , 1 mM MgS0 4 • 7H 2 0, 0.1 mM Na 2 EDTA • 2H 2 0, 
0. 1 mM FeS0 4 • 7H 2 0, 10 mg 1 ~ 1 MnS0 4 • 2H 2 0, 2 mg 1 ~ 1 ZnS0 4 • 7H 2 0, 
3 mg 1 ~ 1 H3BO3, 0.25 mg 1 ~ 1 Na 2 MoS0 4 • 2H 2 0, 0.025 mg 1 ~ 1 CuS0 4 • 5H 2 0, 
0.025 mgl _1 CoCl 2 -6H 2 0, 0.75 mgl _1 KI, 1 x B5 vitamin (10 mgl ~ 1 thiamine, 
1 mgl - 1 pyridoxal phosphate, 1 mgl - 1 nicotinic acid and 100 mgl - 1 myo- 
inositol), 2% sucrose, 0.6% agar, pH 5.8) for 4 days (16 h light/8 h dark). 
Agrobacterium rh.izogen.es strain K599 containing the recombinant constructs was 
grown in yeast extract peptone medium containing 50 mgl -1 kanamycin and 
200 uM acetosyringone at 28 °C for 16 h. It was then used to infect the cotyledons 
through scalpel incisions. The cotyledons were co-cultivated with A. rhizogenes in 
the dark for 5 days on moist filter paper. After that, the infected cotyledons were 
transferred to root-inducing medium (4.3 gl -1 Murashige and Skoog (MS) 
medium, 1 x B5 vitamin, 3% sucrose, 250 mgl -1 cefotaxime and 50 mgl -1 
kanamycin). After 2 weeks, cotyledons with roots emerging from the incision sites 
were transferred to new root-inducing medium with 100 mM NaCI or medium 
without NaCI as untreated control. Root mass was weighed about 2 weeks after 
treatment. 



Gain-of-function test of GmCHXl in transgenic tobacco BY-2 cells. The same 
recombinant constructs used in the hairy root system described above were used to 
transform tobacco BY-2 cells using A. tumefaciens strain LBA4404. For Trypan 
blue staining, 4-day-old BY-2 cells were suspended in MS medium with or without 
100 mM NaCI for 20 h. Cells were stained with Trypan blue (Sigma, cat. no. T8154) 
for 5 min before being observed under the microscope (Nikon E80i). To determine 
ion contents, 25 ml 4-day-old BY-2 cells were mixed with 25 ml MS medium with 
or without 200 mM NaCI (to make up to a final concentration of 100 mM) for 4 h. 
The cells were harvested and washed with 100 ml deionized water by suction 
nitration. Cells were microwaved for 10 s in 5 ml 1% acetic acid at 1,100 W three 
times. Cell lysates were then centrifuged at 13,000g for 10 min and subjected to ion 
content analysis. 

Statistical analyses. The effects of salt treatment on Na + accumulation in the 
leaves of W05 and C08, the survival rate and ion content of tobacco BY-2 cells 
under NaCI treatment were analysed using a one-way analysis of variance followed 
by the Tukey's post hoc test at P<0.05. The effects of salt treatment on the root 
fresh weight of C08 ectopically expressing GmCHXl from W05 or GFP were 
analysed using the Student's f-test. 
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